Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 6 de 6
Filter
Add more filters










Database
Language
Publication year range
1.
bioRxiv ; 2024 Feb 28.
Article in English | MEDLINE | ID: mdl-38464295

ABSTRACT

Deep learning has made rapid advances in modeling molecular sequencing data. Despite achieving high performance on benchmarks, it remains unclear to what extent deep learning models learn general principles and generalize to previously unseen sequences. Benchmarks traditionally interrogate model generalizability by generating metadata based (MB) or sequence-similarity based (SB) train and test splits of input data before assessing model performance. Here, we show that this approach mischaracterizes model generalizability by failing to consider the full spectrum of cross-split overlap, i.e., similarity between train and test splits. We introduce Spectra, a spectral framework for comprehensive model evaluation. For a given model and input data, Spectra plots model performance as a function of decreasing cross-split overlap and reports the area under this curve as a measure of generalizability. We apply Spectra to 18 sequencing datasets with associated phenotypes ranging from antibiotic resistance in tuberculosis to protein-ligand binding to evaluate the generalizability of 19 state-of-the-art deep learning models, including large language models, graph neural networks, diffusion models, and convolutional neural networks. We show that SB and MB splits provide an incomplete assessment of model generalizability. With Spectra, we find as cross-split overlap decreases, deep learning models consistently exhibit a reduction in performance in a task- and model-dependent manner. Although no model consistently achieved the highest performance across all tasks, we show that deep learning models can generalize to previously unseen sequences on specific tasks. Spectra paves the way toward a better understanding of how foundation models generalize in biology.

2.
Nat Mach Intell ; 5(4): 340-350, 2023 Apr.
Article in English | MEDLINE | ID: mdl-38076673

ABSTRACT

Artificial intelligence for graphs has achieved remarkable success in modeling complex systems, ranging from dynamic networks in biology to interacting particle systems in physics. However, the increasingly heterogeneous graph datasets call for multimodal methods that can combine different inductive biases-the set of assumptions that algorithms use to make predictions for inputs they have not encountered during training. Learning on multimodal datasets presents fundamental challenges because the inductive biases can vary by data modality and graphs might not be explicitly given in the input. To address these challenges, multimodal graph AI methods combine different modalities while leveraging cross-modal dependencies using graphs. Diverse datasets are combined using graphs and fed into sophisticated multimodal architectures, specified as image-intensive, knowledge-grounded and language-intensive models. Using this categorization, we introduce a blueprint for multimodal graph learning, use it to study existing methods and provide guidelines to design new models.

3.
Nat Commun ; 13(1): 3817, 2022 07 02.
Article in English | MEDLINE | ID: mdl-35780211

ABSTRACT

Long diagnostic wait times hinder international efforts to address antibiotic resistance in M. tuberculosis. Pathogen whole genome sequencing, coupled with statistical and machine learning models, offers a promising solution. However, generalizability and clinical adoption have been limited by a lack of interpretability, especially in deep learning methods. Here, we present two deep convolutional neural networks that predict antibiotic resistance phenotypes of M. tuberculosis isolates: a multi-drug CNN (MD-CNN), that predicts resistance to 13 antibiotics based on 18 genomic loci, with AUCs 82.6-99.5% and higher sensitivity than state-of-the-art methods; and a set of 13 single-drug CNNs (SD-CNN) with AUCs 80.1-97.1% and higher specificity than the previous state-of-the-art. Using saliency methods to evaluate the contribution of input sequence features to the SD-CNN predictions, we identify 18 sites in the genome not previously associated with resistance. The CNN models permit functional variant discovery, biologically meaningful interpretation, and clinical applicability.


Subject(s)
Mycobacterium tuberculosis , Tuberculosis , Anti-Bacterial Agents , Drug Resistance, Bacterial/genetics , Humans , Mutation , Mycobacterium tuberculosis/genetics , Neural Networks, Computer , Tuberculosis/drug therapy , Tuberculosis/genetics
4.
NPJ Breast Cancer ; 7(1): 147, 2021 Nov 29.
Article in English | MEDLINE | ID: mdl-34845230

ABSTRACT

Histopathologic evaluation of biopsy slides is a critical step in diagnosing and subtyping breast cancers. However, the connections between histology and multi-omics status have never been systematically explored or interpreted. We developed weakly supervised deep learning models over hematoxylin-and-eosin-stained slides to examine the relations between visual morphological signal, clinical subtyping, gene expression, and mutation status in breast cancer. We first designed fully automated models for tumor detection and pathology subtype classification, with the results validated in independent cohorts (area under the receiver operating characteristic curve ≥ 0.950). Using only visual information, our models achieved strong predictive performance in estrogen/progesterone/HER2 receptor status, PAM50 status, and TP53 mutation status. We demonstrated that these models learned lymphocyte-specific morphological signals to identify estrogen receptor status. Examination of the PAM50 cohort revealed a subset of PAM50 genes whose expression reflects cancer morphology. This work demonstrates the utility of deep learning-based image models in both clinical and research regimes, through its ability to uncover connections between visual morphology and genetic statuses.

5.
Lancet Microbe ; 2(3): e96-e104, 2021 03.
Article in English | MEDLINE | ID: mdl-33912853

ABSTRACT

BACKGROUND: Mycobacterium tuberculosis whole genome sequencing (WGS) data can provide insights into temporal and geographical trends in resistance acquisition and inform public health interventions. We aimed to use a large clinical collection of M tuberculosis WGS and resistance phenotype data to study how, when, and where resistance was acquired on a global scale. METHODS: We did a retrospective analysis of WGS data. We curated a set of clinical M tuberculosis isolates with high-quality sequencing and culture-based drug susceptibility data (spanning four lineages and 52 countries in Africa, Asia, the Americas, and Europe) using public databases and literature curation. For inclusion, sequence quality criteria and country of origin data were required. We constructed geographical and lineage specific M tuberculosis phylogenies and used Bayesian molecular dating with BEAST, version 1.10.4, to infer the most recent common susceptible ancestor age for 4869 instances of resistance to ten drugs. FINDINGS: Between Jan 1, 1987, and Sept 12, 2014, of 10 299 M tuberculosis clinical isolates, 8550 were curated, of which 6099 (71%) from 15 countries met criteria for molecular dating. The number of independent resistance acquisition events was lower than the number of resistant isolates across all countries, suggesting ongoing transmission of drug resistance. Ancestral age distributions supported the presence of old resistance, 20 years or more before, in most countries. A consistent order of resistance acquisition was observed globally starting with resistance to isoniazid, but resistance ancestral age varied by country. We found a direct correlation between gross domestic product per capita and resistance age (r 2=0·47; p=0·014). Amplification of fluoroquinolone and second-line injectable resistance among multidrug-resistant isolates is estimated to have occurred very recently (median ancestral age 4·7 years [IQR 1·9-9·8] before sample collection). We found the sensitivity of commercial molecular diagnostics for second-line resistance to vary significantly by country (p<0·0003). INTERPRETATION: Our results highlight that both resistance transmission and amplification are contributing to disease burden globally but vary by country. The observation that wealthier nations are more likely to have old resistance (most recent common susceptible ancestor >20 years before isolation) suggests that programmatic improvements can reduce resistance amplification, but that fit resistant strains can circulate for decades subsequently implies the need for continued surveillance.


Subject(s)
Mycobacterium tuberculosis , Tuberculosis, Lymph Node , Tuberculosis, Multidrug-Resistant , Antitubercular Agents/pharmacology , Bayes Theorem , Drug Resistance, Multiple, Bacterial , Humans , Microbial Sensitivity Tests , Mycobacterium tuberculosis/genetics , Retrospective Studies , Tuberculosis, Lymph Node/drug therapy , Tuberculosis, Multidrug-Resistant/drug therapy
6.
Front Neurol ; 12: 784250, 2021.
Article in English | MEDLINE | ID: mdl-35145468

ABSTRACT

BACKGROUND: Strokes represent a leading cause of mortality globally. The evolution of developing new therapies is subject to safety and efficacy testing in clinical trials, which operate in a limited timeframe. To maximize the impact of these trials, patient cohorts for whom ischemic stroke is likely during that designated timeframe should be identified. Machine learning may improve upon existing candidate identification methods in order to maximize the impact of clinical trials for stroke prevention and treatment and improve patient safety. METHODS: A retrospective study was performed using 41,970 qualifying patient encounters with ischemic stroke from inpatient visits recorded from over 700 inpatient and ambulatory care sites. Patient data were extracted from electronic health records and used to train and test a gradient boosted machine learning algorithm (MLA) to predict the patients' risk of experiencing ischemic stroke from the period of 1 day up to 1 year following the patient encounter. The primary outcome of interest was the occurrence of ischemic stroke. RESULTS: After training for optimization, XGBoost obtained a specificity of 0.793, a positive predictive value (PPV) of 0.194, and a negative predictive value (NPV) of 0.985. The MLA further obtained an area under the receiver operating characteristic (AUROC) of 0.88. The Logistic Regression and multilayer perceptron models both achieved AUROCs of 0.862. Among features that significantly impacted the prediction of ischemic stroke were previous stroke history, age, and mean systolic blood pressure. CONCLUSION: MLAs have the potential to more accurately predict the near risk of ischemic stroke within a 1-year prediction window for individuals who have been hospitalized. This risk stratification tool can be used to design clinical trials to test stroke prevention treatments in high-risk populations by identifying subjects who would be more likely to benefit from treatment.

SELECTION OF CITATIONS
SEARCH DETAIL
...